Linear Regression

🏠 ⮐ Artificial Intelligence ⮐ Machine Learning ⮐ Supervised Learning ⮐ Regression ⮐

Core Concept

Linear regression models the target variable as a linear combination of the features plus an intercept (bias). For a single target (y) and feature vector (\mathbf{x}), the model is (y = \mathbf{w}^\top \mathbf{x} + b) (or (y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)). The learned hyperplane in feature space minimizes a loss over the training set—typically sum of squared errors (SSE) or mean squared error (MSE)—yielding a unique closed-form solution (normal equation) when the design matrix is full rank. This represents the foundational regression approach: interpretable coefficients, fast training, and a single global fit that is well-understood statistically and serves as a baseline for more flexible methods.

Key Characteristics

Closed-form solution – Under squared-error loss, the optimal weights are given by the normal equation (\mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}), assuming (\mathbf{X}^\top \mathbf{X}) is invertible. No iterative optimization is required for the basic formulation, though gradient descent is used for large-scale or regularized variants.
Interpretability – Each coefficient (\beta_j) can be read as the expected change in the target per unit change in (x_j), holding other features constant. Sign and magnitude of coefficients support feature importance and causal-style reasoning, subject to correlation and confounding.
Single global fit – One set of weights applies everywhere in the feature space; the model cannot capture different slopes or curvature in different regions unless features are engineered (e.g. interactions, polynomial terms) or the model is extended (e.g. piecewise linear).
Assumptions – Classical inference (standard errors, confidence intervals) assumes linearity, independence of errors, homoscedasticity (constant error variance), and often normality of errors. Violations affect inference more than the fitted values; robust or heteroscedastic-consistent standard errors can relax some assumptions.
Regularization – Ridge (L2) and Lasso (L1) add penalties on (\mathbf{w}), shrinking coefficients or performing feature selection; they improve generalization when (p) is large or features are correlated and are solved by iterative optimization (e.g. gradient descent) rather than the standard normal equation.

Common Applications

Demand and sales forecasting – Predicting quantity sold or revenue from price, promotion, seasonality, and other covariates
Housing and asset valuation – Estimating price from size, location, number of rooms, and similar attributes
Risk and exposure modeling – Predicting continuous risk scores or exposure levels from demographic and behavioral features
Trend and time-index regression – Modeling a quantity as a linear function of time or an index when the relationship is approximately linear
Causal and policy analysis – Estimating treatment effects or policy impacts when linearity and identification assumptions hold; coefficients support interpretable comparison across groups or conditions
Baseline and residual analysis – Using linear regression as a simple baseline; examining residuals to guide feature engineering or choice of more flexible models

Linear Regression Algorithms
Linear regression algorithms differ in how they minimize the loss (closed-form vs iterative), whether they apply regularization (L2, L1, or both), and in their robustness to outliers, multicollinearity, and scale. Choice depends on sample size, number of features, need for interpretability or sparsity, and whether uncertainty estimates are required.

Ordinary Least Squares (OLS) – Minimizes sum of squared errors via the normal equation (\mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}); closed-form, unique solution when (\mathbf{X}^\top \mathbf{X}) is invertible; interpretable coefficients and standard errors under classical assumptions.

Based on: Least squares / normal equation (foundational closed-form solution)

Method Group: Linear methods

https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares

Ridge Regression – OLS with L2 penalty (\lambda |\mathbf{w}|_2^2); shrinks coefficients toward zero, improving conditioning when features are correlated or (p) is large; solution remains closed-form with ((\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}).

Based on: OLS (adds L2 penalty on coefficients)

Method Group: Linear methods > Regularized regression

https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression

Lasso Regression – OLS with L1 penalty (\lambda |\mathbf{w}|_1); promotes sparse solutions (some coefficients exactly zero), performing feature selection; solved iteratively (e.g. coordinate descent) since no closed-form under L1.

Based on: OLS (adds L1 penalty on coefficients)

Method Group: Linear methods > Regularized regression

https://scikit-learn.org/stable/modules/linear_model.html#lasso

Elastic Net – Combines L1 and L2 penalties; balances Ridge’s stability under correlation with Lasso’s sparsity; useful when many correlated features and group or sparse selection is desired.

Based on: OLS (adds L1 + L2 penalty)

Method Group: Linear methods > Regularized regression

https://scikit-learn.org/stable/modules/linear_model.html#elastic-net

Gradient Descent – Iteratively updates weights by moving in the direction that reduces MSE (or other loss); used when the design matrix is too large for the normal equation or when combined with regularization that has no closed-form.

Based on: First-order optimization (minimizes empirical loss iteratively)

Method Group: Optimization-based fitting

https://scikit-learn.org/stable/modules/sgd.html

Stochastic Gradient Descent (SGD) – Gradient descent using a single example or mini-batch per update; scales to very large (n); requires tuning learning rate and schedule; sklearn’s SGDRegressor supports MSE, Huber, and epsilon-insensitive losses with optional L2/L1 penalty.

Based on: Gradient descent (stochastic/mini-batch updates)

Method Group: Optimization-based fitting

https://scikit-learn.org/stable/modules/sgd.html#regression

Huber Regression – Minimizes a loss that is quadratic for small errors and linear for large ones; more robust to outliers than MSE while remaining differentiable; solved iteratively (e.g. iteratively reweighted least squares or SGD).

Based on: OLS (robust loss function for outliers)

Method Group: Linear methods > Robust regression

https://scikit-learn.org/stable/modules/linear_model.html#huber-regression

RANSAC (RANdom SAmple Consensus) – Fits a linear model to random subsets of data and keeps the model with most inliers; robust to a large fraction of outliers; useful when the data may contain many anomalous points.

Based on: OLS (robust fitting via inlier/outlier voting)

Method Group: Linear methods > Robust regression

https://scikit-learn.org/stable/modules/linear_model.html#ransac-ransac-regressor

Bayesian Linear Regression – Places a prior on the weights and updates to a posterior given the data; provides predictive uncertainty (e.g. posterior predictive distribution) and principled handling of regularization via the prior; closed-form under Gaussian prior and likelihood.

Based on: OLS (Bayesian formulation with prior on weights)

Method Group: Probabilistic / Bayesian methods

https://scikit-learn.org/stable/modules/linear_model.html#bayesian-regression